Goto

Collaborating Authors

 feedback delay


Online EXP3 Learning in Adversarial Bandits with Delayed Feedback

Ilai Bistritz, Zhengyuan Zhou, Xi Chen, Nicholas Bambos, Jose Blanchet

Neural Information Processing Systems

Consider a player that in each of T rounds chooses one of K arms. An adversary chooses the cost of each arm in a bounded interval, and a sequence of feedback delays {dt} that are unknown to the player. After picking arm at at round t, the player receives the cost of playing this arm dt rounds later. In cases where t + dt > T, this feedback is simply missing.


Decentralized Online Convex Optimization with Unknown Feedback Delays

Qiu, Hao, Zhang, Mengxiao, Achddou, Juliette

arXiv.org Machine Learning

Decentralized online convex optimization (D-OCO), where multiple agents within a network collaboratively learn optimal decisions in real-time, arises naturally in applications such as federated learning, sensor networks, and multi-agent control. In this paper, we study D-OCO under unknown, time-and agent-varying feedback delays. While recent work has addressed this problem (Nguyen et al., 2024), existing algorithms assume prior knowledge of the total delay over agents and still suffer from suboptimal dependence on both the delay and network parameters. To overcome these limitations, we propose a novel algorithm that achieves an improved regret bound of O N $\sqrt$ d tot + N $\sqrt$ T (1-$σ$2) 1/4 , where T is the total horizon, d tot denotes the average total delay across agents, N is the number of agents, and 1 -$σ$ 2 is the spectral gap of the network. Our approach builds upon recent advances in D-OCO (Wan et al., 2024a), but crucially incorporates an adaptive learning rate mechanism via a decentralized communication protocol. This enables each agent to estimate delays locally using a gossip-based strategy without the prior knowledge of the total delay. We further extend our framework to the strongly convex setting and derive a sharper regret bound of O N $δ$max ln T $α$ , where $α$ is the strong convexity parameter and $δ$ max is the maximum number of missing observations averaged over agents. We also show that our upper bounds for both settings are tight up to logarithmic factors. Experimental results validate the effectiveness of our approach, showing improvements over existing benchmark algorithms.



A Illustration of RCL

Neural Information Processing Systems

We illustrate the online optimization process of RCL in Figure 1. We set b = 10 and A = I for the cost function in Eqn. The testing process is almost instant and takes less than 1 second. It does not use robustification during online optimization. By Theorem 4.1, there is a trade-off (governed by ML predictions for those problem instances that are adversarial to ROBD.






Robust Learning for Smoothed Online Convex Optimization with Feedback Delay

Neural Information Processing Systems

We study a general form of Smoothed Online Convex Optimization, a.k.a. We propose a novel machine learning (ML) augmented online algorithm, Robustness-Constrained Learning (RCL), which combines untrusted ML predictions with a trusted expert online algorithm via constrained projection to robustify the ML prediction. Specifically, we prove that RCL is able to guarantee (1 \lambda) -competitiveness against any given expert for any \lambda 0, while also explicitly training the ML model in a robustification-aware manner to improve the average-case performance. Importantly, RCL is the first ML-augmented algorithm with a provable robustness guarantee in the case of multi-step switching cost and feedback delay. We demonstrate the improvement of RCL in both robustness and average performance using battery management as a case study.


Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Chen, Ziqun, Cai, Kechao, Chen, Zhuoyue, Zhang, Jinbei, Lui, John C. S.

arXiv.org Machine Learning

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.